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Abstract 

Predicting the nodes of a given graph is a fascinating theoretical problem with applications in several 
domains. Since graph sparsification via spanning trees retains enough information while making the task 
much easier, trees are an important special case of this problem. Although it is known how to predict the 
nodes of an unweighted tree in a nearly optimal way, in the weighted case a fully satisfactory algorithm 
is not available yet. We fill this hole and introduce an efficient node predictor, Shazoo, which is nearly 
optimal on any weighted tree. Moreover, we show that Shazoo can be viewed as a common nontrivial 
generalization of both previous approaches for unweighted trees and weighted lines. Experiments on 
real-world datasets confirm that Shazoo performs well in that it fully exploits the structure of the input 
tree, and gets very close to (and sometimes better than) less scalable energy minimization methods. 



1 Introduction 

Predictive analysis of networked data is a fast-growing research area whose application domains include 
document networks, online social networks, and biological networks. In this work we view networked 
data as weighted graphs, and focus on the task of node classification in the transductive setting, i.e., when 
the unlabeled graph is available beforehand. Standard transductive classification methods, such as label 
propagation (21 H9J , work by optimizing a cost or energy function defined on the graph, which includes 
the training information as labels assigned to training nodes. Although these methods perform well in 
practice, they are often computationally expensive, and have performance guarantees that require statistical 
assumptions on the selection of the training nodes. 

A general approach to sidestep the above computational issues is to sparsify the graph to the largest 
possible extent, while retaining much of its spectral properties — see, e.g., |Hl|6l[T3l[T7l- Inspired by [5]|6l, 
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this paper reduces the problem of node classification from graphs to trees by extracting suitable spanning 
trees of the graph, which can be done quickly in many cases. The advantage of performing this reduction 
is that node prediction is much easier on trees than on graphs. This fact has recently led to the design 
of very scalable algorithms with nearly optimal performance guarantees in the online transductive model, 
which comes with no statistical assumptions. Yet, the current results in node classification on trees are 
not satisfactory. The TreeOpt strategy of is optimal to within constant factors, but only on unweighted 
trees. No equivalent optimality results are available for general weighted trees. To the best of our knowledge, 
the only other comparable result is WTA by (6), which is optimal (within log factors) only on weighted lines. 
In fact, WTA can still be applied to weighted trees by exploiting an idea contained in iflOll . This is based 
on linearizing the tree via a depth-first visit. Since linearization loses most of the structural information 
of the tree, this approach yields suboptimal mistake bounds. This theoretical drawback is also confirmed 
by empirical performance: throwing away the tree structure negatively affects the practical behavior of the 
algorithm on real-world weighted graphs. 

The importance of weighted graphs, as opposed to unweighted ones, is suggested by many practical 
scenarios where the nodes carry more information than just labels, e.g., vectors of feature values. A natural 
way of leveraging this side information is to set the weight on the edge linking two nodes to be some function 
of the similarly between the vectors associated with these nodes. In this work, we bridge the gap between 
the weighted and unweighted cases by proposing a new prediction strategy, called SHAZOO, achieving a 
mistake bound that depends on the detailed structure of the weighted tree. We carry out the analysis using a 
notion of learning bias different from the one used in [6] and more appropriate for weighted graphs. More 
precisely, we measure the regularity of the unknown node labeling via the weighted cutsize induced by the 
labeling on the tree (see Section [3] for a precise definition). This replaces the unweighted cutsize that was 
used in the analysis of WTA. When the weighted cutsize is used, a cut edge violates this inductive bias in 
proportion to its weight. This modified bias does not prevent a fair comparison between the old algorithms 
and the new one: Shazoo specializes to TreeOpt in the unweighted case, and to wta when the input 
tree is a weighted line. By specializing Shazoo's analysis to the unweighted case we recover TreeOpt's 
optimal mistake bound. When the input tree is a weighted line, we recover WTA's mistake bound expressed 
through the weighted cutsize instead of the unweighted one. The effectiveness of SHAZOO on any tree is 
guaranteed by a corresponding lower bound (see Section[3]). 

Shazoo can be viewed as a common nontrivial generalization of both TreeOpt and wta. Obtaining 
this generalization while retaining and extending the optimality properties of the two algorithms is far from 
being trivial from a conceptual and technical standpoint. Since SHAZOO works in the online transductive 
model, it can easily be applied to the more standard train/test (or "batch") transductive setting: one simply 
runs the algorithm on an arbitrary permutation of the training nodes, and obtains a predictive model for 
all test nodes. However, the implementation might take advantage of knowing the set of training nodes 
beforehand. For this reason, we present two implementations of SHAZOO, one for the online and one for the 
batch setting. Both implementations result in fast algorithms. In particular, the batch one is linear in |V|. 
This is achieved by a fast algorithm for weighted cut minimization on trees, a procedure which lies at the 
heart of Shazoo. 

Finally, we test SHAZOO against WTA, label propagation, and other competitors on real-world weighted 
graphs. In almost all cases (as expected), we report improvements over WTA due to the better sensitivity 
to the graph structure. In some cases, we see that SHAZOO even outperforms standard label propagation 
methods. Recall that label propagation has a running time per prediction which is proportional to | E\ , where 
E is the graph edge set. On the contrary, SHAZOO can typically be run in constant amortized time per 
prediction by using Wilson's algorithm for sampling random spanning trees |[T8l . By disregarding edge 
weights in the initial sampling phase, this algorithm is able to draw a random (unweighted) spanning tree 
in time proportional to \ V\ on most graphs. Our experiments reveal that using the edge weights only in the 
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subsequent prediction phase causes in practice only a minor performance degradation. 

2 Preliminaries and basic notation 

Let T = (V, E, W) be an undirected and weighted tree with |V| = n nodes, positive edge weights Wy > 
for G E, and W%j = for (£, j) ^ E. A binary labeling of T is any assignment y = (y±, . . . , y n ) G 
{ — 1, +1}™ of binary labels to its nodes. We use (T, y) to denote the resulting labeled weighted tree. The 
online learning protocol for predicting (T, y) is defined as follows. The learner is given T while y is kept 
hidden. The nodes of T are presented to the learner one by one, according to an unknown and arbitrary 
permutation i\, . . . , i n of V. At each time step t = 1, . . . , n node i t is presented and the learner must issue 
a prediction % t G {—1, +1} for the label yi t . Then yj t is revealed and the learner knows whether a mistake 
occurred. The learner's goal is to minimize the total number of prediction mistakes. 

Following previous works iTITIfTOl 151161171. we measure the regularity of a labeling y of T in terms of 
0-edges, where a 0-edge for (T, y) is any G E such that yj / yj. The overall amount of irregularity 
in a labeled tree (T,y) is the weighted cutsize & w = Yl{ij)&E-t> where E^ C E 1 is the subset of 
0-edges in the tree. We use the weighted cutsize as our learning bias, that is, we want to design algorithms 
whose predictive performance scales with $ . Unlike the 0-edge count = \E^\, which is a good measure 
of regularity for unweighted graphs, the weighted cutsize takes the edge weight Wij into accounQ when 
measuring the irregularity of a c/>-edge In the sequel, when we measure the distance between any 

pair of nodes i and j on the input tree T we always use the resistance distance metric d, that is, d(i,j) = 
Y,( r ,s)Eir(i,j) vi^7> wnere n ihj) is the unique path connecting i to j. 

3 A lower bound for weighted trees 

In this section we show that the weighted cutsize can be used as a lower bound on the number of online 
mistakes made by any algorithm on any tree. In order to do so (and unlike previous papers on this specific 
subject — see, e.g., (6l), we need to introduce a more refined notion of adversarial "budget". Given T = 
(V, E, W), let £(M) be the maximum number of edges of T such that the sum of their weights does not 

exceed M, £(M) = max : E' C E, Y2(ij)eE' Wi <i — ^} ■ ^ e nave tne following simple lower 

bound (all proofs are omitted from this extended abstract). 

Theorem 1 For any weighted tree T = (V, E, W) there exists a randomized label assignment to V such 
that any algorithm can be forced to make at least £(M)/2 online mistakes in expectation, while & w < M. 

Specializing (6l Theorem 1] to trees gives the lower bound K/2 under the constraint $ < K < \V\. The 
main difference between the two bounds is the measure of label regularity being used: Whereas Theorem[T] 
uses <& w , which depends on the weights, 10 Theorem 1] uses the weight-independent quantity This 
dependence of the lower bound on the edge weights is consistent with our learning bias, stating that a heavy 
0-edge violates the bias more than a light one. Since £ is nondecreasing, the lower bound implies a number 
of mistakes of at least £($ H/ )/2. Note that ^(^ w ) > $ for any labeled tree (T,y). Hence, whereas a 
constraint K on implies forcing at least K/2 mistakes, a constraint M on <& w allows the adversary to 
force a potentially larger number of mistakes. 

In the next section we describe an algorithm whose mistake bound nearly matches the above lower 
bound on any weighted tree when using £(<I> H/ ) as the measure of label regularity. 

1 The weight value Wij typically encodes the strength of the connection In fact, when the nodes of a graph host more 

information than just binary labels, e.g., a vector of feature velues, then a reasonable choice is to set Wij to be some (decreasing) 
function of the distance between the feature vectors sitting at the two nodes % and j . See also Remark[2| 
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4 The Shazoo algorithm 



In this section we introduce the SHAZOO algorithm, and relate it to previously proposed methods for online 
prediction on unweighted trees (TreeOpt from [5]) and weighted line graphs (WTA from [6]). In fact, 
Shazoo is optimal on any weighted tree, and reduces to TreeOpt on unweighted trees and to wta on 
weighted line graphs. Since TreeOpt and WTA are optimal on any unweighted tree and any weighted line 
graph, respectively, SHAZOO necessarily contains elements of both of these algorithms. 

In order to understand our algorithm, we now define some relevant structures of the input tree T. See 
Figure [T] (left) for an example. These structures evolve over time according to the set of observed labels. 
First, we call revealed a node whose label has already been observed by the online learner; otherwise, a 
node is unrevealed. A fork is any unrevealed node connected to at least three different revealed nodes by 
edge-disjoint paths. A hinge node is either a revealed node or a fork. A hinge tree is any component of 
the forest obtained by removing from T all edges incident to hinge nodes; hence any fork or labeled node 
forms a 1-node hinge tree. When a hinge tree H contains only one hinge node, a connection node for H is 
the node contained in H. In all other cases, we call a connection node for H any node outside H which is 
adjacent to a node in H. A connection fork is a connection node which is also a fork. Finally, a hinge line 
is any path connecting two hinge nodes such that no internal node is a hinge node. 




Figure 1 : Left: An input tree. Revealed nodes are dark grey, forks are doubly circled, and hinge lines have 
thick black edges. The hinge trees not containing hinge nodes (i.e., the ones that are not singletons) are 
enclosed by dotted lines. The dotted arrows point to the connection node(s) of such hinge trees. Middle: 
The predictions of SHAZOO on the nodes of a hinge tree. The numbers on the edges denote edge weights. At 
a given time t, SHAZOO uses the value of A on the two hinge nodes (the doubly circled ones, which are also 
forks in this case), and is required to issue a prediction on node i t (the black node in this figure). Since i t is 
between a positive A hinge node and a negative A hinge node, SHAZOO goes with the one which is closer 
in resistance distance, hence predicting % t = —1. Right: A simple example where the mincut prediction 
strategy does not work well in the weighted case. In this example, mincut mispredicts all labels, yet $ = 1, 
and the ratio of § w to the total weight of all edges is about 1/| V|. The labels to be predicted are presented 
according to the numbers on the left of each node. Edge weights are also displayed, where a is a very small 
constant. 

Given an unrevealed node i and a label value y G {—1, +1}, the cut function cut(i, y) is the value of 
the minimum weighted cutsize of T over all labelings y E {—1, +l} n consistent with the labels seen so far 
and such that y. L = y. Define A(i) = cut(i, —1) — cut(i, +1) if i is unrevealed, and A(i) = yi, otherwise. 
The algorithm's pseudocode is given in Algorithm[T] At time t, in order to predict the label yi t of node i t , 
SHAZOO calculates A(i) for all connection nodes i of H{i t ), where H (i t ) is the hinge tree containing i t . 
Then the algorithm predicts yi t using the label of the connection node i of H(i t ) which is closest to i t and 
such that A(i) ^ (recall from Section|2]that all distances/lengths are measured using the resistance metric). 
Ties are broken arbitrarily. If A(i) = for all connection nodes i in H(i t ) then SHAZOO predicts a default 
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value (—1 in the pseudocode). If i t is a fork (which is also a hinge node), then H(i t ) = {it}- In this case, it 
is a connection node of H(i t ), and obviously the one closest to itself. Hence, in this case SHAZOO predicts 
yt simply by % t = sgn(A(zt)). See Figure 111 (middle) for an example. On unweighted trees, computing 



Algorithm 1: Shazoo 



for f = 1 ... n 

Let C (H(it)) be the set of the connection nodes i of H(i t ) for which A(i) ^ 
if C{H(it))^$ 

Let j be the node of C[H(i t )) closest to i t 

Set y k = sgn(A(j)) 

se 

Set yi t = — 1 (default value) 



A(i) for a connection node i reduces to the Fork Label Estimation Procedure in ||5] Lemma 13]. On the 
other hand, predicting with the label of the connection node closest to i t in resistance distance is reminiscent 
of the nearest-neighbor prediction of WTA on weighted line graphs [ 6 ] . In fact, as in WTA, this enables to 
take advantage of labelings whose </>-edges are light weighted. An important limitation of WTA is that this 
algorithm linearizes the input tree. On the one hand, this greatly simplifies the analysis of nearest-neighbor 
prediction; on the other hand, this prevents exploiting the structure of T, thereby causing logaritmic slacks 
in the upper bound of WTA. The TreeOpt algorithm, instead, performs better when the unweighted input 
tree is very different from a line graph (more precisely, when the input tree cannot be decomposed into long 
edge-disjoint paths, e.g., a star graph). Indeed, TreeOpt's upper bound does not suffer from logaritmic 
slacks, and is tight up to constant factors on any unweighted tree. Similar to TreeOpt, Shazoo does not 
linearize the input tree and extends to the weighted case TreeOpt's superior performance, also confirmed 
by the experimental comparison reported in Section [6j 

In Figure[T](right) we show an example that highlights the importance of using the A function to compute 
the fork labels. Since A predicts a fork i t with the label that minimizes the weighted cutsize of T consistent 
with the revealed labels, one may wonder whether computing A through mincut based on the number of 
0-edges (rather than their weighted sum) could be an effective prediction strategy. Figure [T] (right) illustrates 
an example of a simple tree where such a A mispredicts the labels of all nodes, when both & w and <E> are 
small. 

Remark 1 We would like to stress that SHAZOO can also be used to predict the nodes of an arbitrary 
graph by first drawing a random spanning tree T of the graph, and then predicting optimally on T — see, 
e -g-, El/- The resulting mistake bound is simply the expected value of SHAZOO 's mistake bound over 
the random draw of T. By using a fast spanning tree sampler M8\l , the involved computational overhead 
amounts to constant amortized time per node prediction on "most" graphs. 

Remark 2 In certain real-world input graphs, the presence of an edge linking two nodes may also carry 
information about the extent to which the two nodes are dissimilar, rather than similar. This information 
can be encoded by the sign of the weight, and the resulting network is called a signed graph. The regularity 
measure is naturally extended to signed graphs by counting the weight of frustrated edges (e.g.,^S), where 
is frustrated if y%yj ^ sgn(u;j j). Many of the existing algorithms for node classification h~19\ 170] 
1771 \5\ [9] |6|/ can in principle be run on signed graphs. However, the computational cost may not always 
be preserved. For example, mincut /@/ is in general NP-hard when the graph is signed HI 4V . Since our 
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algorithm sparsifies the graph using trees, it can be run efficiently even in the signed case. We just need to 
re-define the A function as A(i) = fcut(i, —1) — fcut(z, +1), where fcut is the minimum total weight of 
frustrated edges consistent with the labels seen so far. The argument contained in Section^for the positive 
edge weights (see, e.g., Eq. (|7J therein) allows us to show that also this version of IS. can be computed 
efficiently. The prediction rule has to be re-defined as well: We count the parity of the number z of negative- 
weighted edges along the path connecting it to the closest node j G C (H(it)), i.e., yi t = (— l) 2 sgn(A(j)). 

Remark 3 In [5] the authors note that TreeOpt approximates a version space (Halving) algorithm on the 
set of tree labelings. Interestingly, SHAZOO is also an approximation to a more general Halving algorithm 
for weighted trees. This generalized Halving gives a weight to each labeling consistent with the labels seen 
so far and with the sign of A(/) for each fork f. These weighted labelings, which depend on the weights 
of the 4>-edges generated by each labeling, are used for computing the predictions. One can show (details 
omitted due to space limitations) that this generalized Halving algorithm has a mistake bound within a 
constant factor of SHAZOO 's. 

5 Mistake bound analysis and implementation 

We now show that SHAZOO is nearly optimal on every weighted tree T. We obtain an upper bound in terms 
of <& w and the structure of T, nearly matching the lower bound of Theorem[l] We now give some auxiliary 
notation that is strictly needed for stating the mistake bound. 

Given a labeled tree (T,y), a cluster is any maximal subtree whose nodes have the same label. An 
in-cluster line graph is any line graph that is entirely contained in a single cluster. Finally, given a line 
graph L, we set R^ = ^ eL i.e., the (resistance) distance between its terminal nodes. 

Theorem 2 For any labeled and weighted tree (T,y), there exists a set Ct of 0(£,(<& w )) edge-disjoint 
in-cluster line graphs such that the number of mistakes made by SHAZOO is at most of the order of 

min{|L|, 1 + [log(l + $ w Rf )J } . 

Lec T 

The above mistake bound depends on the tree structure through Ct- The sum contains 0(£($> w )) terms, 
each one being at most logarithmic in the scale-free products <& w R 1 ^ . The bound is governed by the same 
key quantity occurring in the lower bound of Theorem [I] However, Theorem [2] also shows that 

SHAZOO can take advantage of trees that cannot be covered by long line graphs. For example, if the input 
tree T is a weighted line graph, then it is likely to contain long in-cluster lines. Hence, the factor multiplying 
^(Q w ^j may be of the order of log(l + <& w R^)- If> instead, T has constant diameter (e.g., a star graph), 
then the in-cluster lines can only contain a constant number of nodes, and the number of mistakes can never 
exceed O (£(& w )) . This is a log factor improvement over WTA which, by its very nature, cannot exploit the 
structure of the tree it operates onQ 

As for the implementation, we start by describing a method for calculating cut(u, y) for any unlabeled 
node v and label value y. Let T v be the maximal subtree of T rooted at v, such that no internal node is 
revealed. For any node iofT v , let Tf be the subtree of T v rooted at i. Let ^ (y) be the minimum weighted 
cutsize of Tf consistent with the revealed nodes and such that yj = y. Since A(v) = cut(u, — 1) — 
cut(t>,+l) = $JJ(— 1) — <3?£J(+1), our goal is to compute 3>£(y). It is easy to see by induction that the 

2 One might wonder whether an arbitrarily large gap between upper (Theorem[2| and lower (Theorem[I| bounds exists due to 
the extra factors depending on & w RY ■ One way to get around this is to follow the analysis of WTA in |6J. Specifically, we can 
adapt here the more general analysis from that paper (see Lemma 2 therein) that allows us to drop, for any integer K, the resistance 
contribution of K arbitrary non-0 edges of the line graphs in Ct (thereby reducing RY for any L containing any of these edges) 
at the cost of increasing the mistake bound by K. The details will be given in the full version of this paper. 
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quantity &"(y) can be recursively defined as follows, where C\ is the set of all children of % in T v , and 
Yj = {yj} if yj is revealed, and Yj = {—1, +1}, otherwise^] 

{V^ min \ <& v Ay') + l{y' y \ Wi j) if i is an internal node of T v 
j&i v ' eYj V ' (1) 

otherwise. 

Now, $>v(y) can be computed through a simple depth-first visit of T v . In all backtracking steps of this visit 
the algorithm uses (|TJ) to compute ^(y) for each node i, the values <&J(y) for all children j of i being 
calculated during the previous backtracking steps. The total running time is therefore linear in the number 
of nodes of T>. 

Next, we describe the basic implementation of SHAZOO for the on-line setting. A batch learning imple- 
mentation will be given at the end of this section. The online implementation is made up of three steps. 

1. Find the hinge nodes of subtree T H . Recall that a hinge-node is either a fork or a revealed node. 
Observe that a fork is incident to at least three nodes lying on different hinge lines. Hence, in this step we 
perform a depth-first visit of T H , marking each node lying on a hinge line. In order to accomplish this task, 
it suffices to single out all forks marking each labeled node and, recursively, each parent of a marked node 
of T H . At the end of this process we are able to single out the forks by counting the number of edges 

of each marked node % such that j has been marked, too. The remaining hinge nodes are the leaves of T H 
whose labels have currently been revealed. 

2. Compute sgn(A(i)) for all connection forks of H(i t ). From the previous step we can easily find 
the connection node(s) of H(i t ). Then, we simply exploit the above-described technique for computing the 
cut function, obtaining sgn(A(i)) for all connection forks i of H(i t ). 

3. Propagate the labels of the nodes of C{H{i t )) (only if i t is not a fork). We perform a visit of H(i t ) 
starting from every node r G C(H{i t )). During these visits, we mark each node j of H(i t ) with the label 
of r computed in the previous step, together with the length of 7r(r, j), which is what we need for predicting 
any label of H[it) at the current time step. 

The overall running time is dominated by the first step and the calculation of A(i). Hence the worst case 
running time is proportional to Y2t<\v\ I^C^ 1 *)!- This quantity can be quadratic in \V\, though this is rarely 
encountered in practice if the node presentation order is not adversarial. For example, it is easy to show that 
in a line graph, if the node presentation order is random, then the total time is of the order of |V| log \ V\. 
For a star graph the total time complexity is always linear in \ V\, even on adversarial orders. 

In many real-world scenarios, one is interested in the more standard problem of predicting the labels 
of a given subset of test nodes based on the available labels of another subset of training nodes. Building 
on the above on-line implementation, we now derive an implementation of SHAZOO for this train/test (or 
"batch learning") setting. We first show that computing |<3?-(+l)| and |$|(— 1)| for all unlabeled nodes i in 
T takes C(| V|) time. This allows us to compute sgn(A(i>)) for all forks v in C(|V|) time, and then use the 
first and the third steps of the on-line implementation. Overall, we show that predicting all labels in the test 
set takes 0{\V\) time. 

Consider tree T" as rooted at i. Given any unlabeled node i, we perform a visit of T l starting at i. 
During the backtracking steps of this visit we use |TJ) to calculate for each node j in T l and label 

y € {— 1, +1}. Observe now that for any pair i, jol adjacent unlabeled nodes and any label y € { — 1, +1}, 
once we have obtained <&\{y), and <&*■(— 1), we can compute $j(y) in constant time, as $\{y) = 

<&\{y) — min^/ 6 {_i j+ i}(<l>j(y / ) 7^ y}wij). In fact, all children of j in T l are descendants of i, while 

the children of i in T l (but j) are descendants of j in TK SHAZOO computes ^■(y), we can compute in 
constant time ^(y) for all child nodes j of i in T\ and use this value for computing ^ (y). Generalizing 



The recursive computations contained in this section are reminiscent of the sum-product algorithm 1121 . 
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this argument, it is easy to see that in the next phase we can compute 3>|(y) in constant time for all nodes 
k of T l such that for all ancestors u of k and all y £ {—1, +1}, the values of <&"(y) have previously been 
computed. 

The time for computing & s s (y) for all nodes s of T l and any label y is therefore linear in the time of 
performing a breadth-first (or depth-first) visit of T\ i.e., linear in the number of nodes of T l . Since each 
labeled node with degree d is part of at most d trees T % for some i, we have that the total number of nodes 
of all distinct (edge-disjoint) trees T % across i S V is linear in |V|. 

Finally, we need to propagate the connection node labels of each hinge tree as in the third step of 
the online implementation. Since also this last step takes linear time, we conclude that the total time for 
predicting all labels is linear in \ V\. 

6 Experiments 

We tested our algorithm on a number of real-world weighted graphs from different domains (character 
recognition, text categorization, bioinformatics, Web spam detection) against the following baselines: 

Online Majority Vote (OMV). This is an intuitive and fast algorithm for sequentially predicting the 
node labels is via a weighted majority vote over the labels of the adjacent nodes seen so far. Namely, OMV 
predicts yi t through the sign of Y^ s Vi a Wi a ^ t , where s ranges over s < t such that (i s ,it) 6 E. Both the total 
time and space required by OMV are 0(|.E|). 

Label Propagation (LabProp). LabProp |[19l|2j|3l is a batch transductive learning method computed 
by solving a system of linear equations which requires total time of the order of | ^7 1 x \ V\. This relatively 
high computational cost should be taken into account when comparing LabProp to faster online algorithms. 
Recall that OMV can be viewed as a fast "online approximation" to LabProp. 

Weighted Tree Algorithm (WTA). As explained in the introductory section, WTA can be viewed as a 
special case of SHAZOO. When the input graph is not a line, WTA turns it into a line by first extracting a 
spanning tree of the graph, and then linearizing it. The implementation described in |0 runs in constant 
amortized time per prediction whenever the spanning tree sampler runs in time 0(| V|). 

The Graph Perceptron algorithm ifTTIl is another readily available baseline. This algorithm has been 
excluded from our comparison because it does not seem to be very competitive in terms of performance 
(see, e.g., J6[), and is also computationally expensive. 

In our experiments, we combined SHAZOO and WTA with spanning trees generated in different ways 
(note that OMV and LabProp do not need to extract spanning trees from the input graph). 

Random Spanning Tree (RST). Following Ch. 4 of |fl~3]. we draw a weighted spanning tree with 
probability proportional to the product of its edge weights. We also tested our algorithms combined with 
random spanning trees generated uniformly at random ignoring the edge weights (i.e., the weights were 
only used to compute predictions on the randomly generated tree) — we call these spanning trees NWRST 
(no-weight RST). On most graphs, this procedure can be run in time linear in the number of nodes [ 18 ]. 
Hence, the combinations SHAZOO+NWRST and WTA+NWRST run in 0(|V|) time on most graphs. 

Minimum Spanning Tree (MST). This is the spanning tree minimizing the sum of the resistors on its 
edges. This tree best approximates the original graph in terms of the trace norm distance of the correspond- 
ing Laplacian matrices. 

Following lfm i6l. we also ran SHAZOO and WTA using committees of spanning trees, and then aggre- 
gating predictions via a majority vote. The resulting algorithms are denoted by A;*SHAZOO and fc*WTA, 
where k is the number of spanning trees in the aggregation. We used either k = 7, 11 or k = 3, 7, depending 
on the dataset size. 

For our experiments, we used five datasets: RCV1, USPS, KROGAN, COMBINED, and WEBSPAM. 
WEBSPAM is a big dataset (110,900 nodes and 1,836,136 edges) of inter-host links created for the Web 
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Spam Challenge 2008 [16|Q KROGAN (2,169 nodes and 6,102 edges) and COMBINED (2,871 nodes 
and 6,407 edges) are high-throughput protein-protein interaction networks of budding yeast taken from [15 ] 
— see ID for a more complete description. Finally, USPS and RCV1 are graphs obtained from the USPS 
handwritten characters dataset (all ten categories) and the first 10,000 documents in chronological order of 
Reuters Corpus Vol. 1 (the four most frequent categories), respectively. In both cases, we used Euclidean 10- 

II II 2 / 2 

Nearest Neighbor to create edges, each weight Wij being equal to e~\\ Xi ~ x iW i u i,j_ We set of • = ^ (o"?+cr?) , 
where af is the average squared distance between i and its 10 nearest neighbours. 

Following previous experimental settings O, we associate binary classification tasks with the five 
datasets/graphs via a standard one-vs-all reduction. Each error rate is obtained by averaging over ten ran- 
domly chosen training sets (and ten different trees in the case of RST and NWRST). WEBSPAM is natively a 
binary classification problem, and we used the same train/test split provided with the dataset: 3,897 training 
nodes and 1,993 test nodes (the remaining nodes being unlabeled). 

In the below table, we show the macro-averaged classification error rates (percentages) achieved by the 
various algorithms on the first four datasets mentioned in the main text. For each dataset we trained ten times 
over a random subset of 5%, 10% and 25% of the total number of nodes and tested on the remaining ones. 
In boldface are the lowest error rates on each column, excluding LabProp which is used as a "yardstick" 
comparison. Standard deviations averaged over the binary problems are small: most of the times less than 
0.5%. 



Datasets 




USPS 






RCV1 






KROGAN 




COMBINED 


Predictors 


5% 


10% 


25% 


5% 


10% 


25% 


5% 


10% 


25% 


5% 


10% 


25% 


Shazoo+rst 


3.62 


2.82 


2.02 


21.72 


18.70 


15.68 


18.11 


17.68 


17.10 


17.77 


17.24 


17.34 


Shazoo+nwrst 


3.88 


3.03 


2.18 


21.97 


19.21 


15.95 


18.11 


18.14 


17.32 


17.22 


17.21 


17.53 


Shazoo+mst 


1.07 


0.96 


0.80 


17.71 


14.87 


11.73 


17.46 


16.92 


16.30 


16.79 


16.64 


17.15 


WTA+RST 


5.34 


4.23 


3.02 


25.53 


22.66 


19.05 


21.82 


21.05 


20.08 


21.76 


21.38 


20.26 


WTA+NWRST 


5.74 


4.45 


3.26 


25.50 


22.70 


19.24 


21.90 


21.28 


20.18 


21.58 


21.42 


20.64 


WTA+MST 


1.81 


1.60 


1.21 


21.07 


17.94 


13.92 


21.41 


20.63 


19.61 


21.74 


21.20 


20.32 


7*SHAZOO+RST 


1.68 


1.28 


0.97 


16.33 


13.52 


11.07 


15.54 


15.58 


15.46 


15.12 


15.24 


15.84 


7*Shazoo+nwrst 


1.89 


1.38 


1.06 


16.49 


13.98 


11.37 


15.61 


15.62 


15.50 


15.02 


15.12 


15.80 


7 !! 'WTA+RST 


2.10 


1.56 


1.14 


17.44 


14.74 


12.15 


16.75 


16.64 


15.88 


16.42 


16.09 


15.72 


7*wta+nwrst 


2.33 


1.73 


1.24 


17.69 


15.18 


12.53 


16.71 


16.60 


16.00 


16.24 


16.13 


15.79 


1 1*Shazoo+rst 


1.52 


1.17 


0.89 


15.82 


13.04 


10.59 


15.36 


15.40 


15.29 


14.91 


15.06 


15.61 


1 1*Shazoo+nwrst 


1.70 


1.27 


0.98 


15.95 


13.42 


10.93 


15.40 


15.33 


15.32 


14.87 


14.99 


15.67 


1 1*WTA+RST 


1.84 


1.36 


1.01 


16.40 


13.95 


11.42 


16.20 


16.15 


15.53 


15.90 


15.58 


15.30 


11*WTA+NWRST 


2.04 


1.51 


1.12 


16.70 


14.28 


11.68 


16.22 


16.05 


15.50 


15.74 


15.57 


15.33 


OMV 


24.79 


12.34 


2.10 


31.65 


22.35 


11.79 


43.13 


38.75 


29.84 


44.72 


40.86 


33.24 


LabProp 


1.95 


1.11 


0.82 


16.28 


12.99 


10.00 


15.56 


14.98 


15.23 


14.79 


14.93 


15.18 



Next, we extract from the above table a specific comparison among SHAZOO, WTA, and LabProp. Sha- 
ZOO and WTA use a single minimum spanning tree (the best performing tree type for both algorithms). Note 
that SHAZOO consistently outperforms WTA. 

USPS RCV1 KROGAN COMBINED 




■ SHAZOO+MST ■ WTA+MST ■ LABPROP 

We then report the results on WEBSPAM. SHAZOO and WTA use only non- weighted random spanning trees 
(NWRST) to optimize scalability. Since this dataset is extremely unbalanced (5.4% positive labels) we use 
the average test set F-measure instead of the error rate. 



SHAZOO 


WTA 


OMV 


LabProp 


3* WTA 


3*Shazoo 


7*WTA 


7*SHAZOO 


0.954 


0.947 


0.706 


0.931 


0.967 


0.964 


0.968 


0.968 



4 We do not compare our results to those obtained within the challenge since we are only exploiting the graph (weighted) 
topology here, disregarding content features. 
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Our empirical results can be briefly summarized as follows: 

1. Without using committees, SHAZOO outperforms WTA on all datasets, irrespective to the type of 
spanning tree being used. With committees, SHAZOO works better than WTA almost always, although the 
gap between the two reduces. 

2. The predictive performance of SHAZOO+MST is comparable to, and sometimes better than, that of 
Lab Prop, though the latter algorithm is slower. 

3. /c*SHAZOO, with k = 11 (or k = 7 on WEBSPAM) seems to be especially effective, outperforming 
Lab Prop, with a small (e.g., 5%) training set size. 

4. NWRST does not offer the same theoretical guarantees as RST, but it is extremely fast to generate 
(linear in \ V\ on most graphs — e.g., [T|), and in our experiments is only slightly inferior to RST. 



References 

[1] N. Alon, C. Avin, M. Koucky, G. Kozma, Z. Lotker, and M.R. Tuttle. Many random walks are faster 
than one. In Proc. 20th Symp. on Parallel Algo. and Architectures, pages 119-128. Springer, 2008. 

[2] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large graphs. 
In Proceedings of the 17th Annual Conference on Learning Theory, pages 624-638. Springer, 2004. 

[3] Y. Bengio, O. Delalleau, and N. Le Roux. Label propagation and quadratic criterion. In Semi- 
Supervised Learning, pages 193-216. MIT Press, 2006. 

[4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceed- 
ings of the 18th International Conference on Machine Learning. Morgan Kaufmann, 2001. 

[5] N. Cesa-Bianchi, C. Gentile, and F.Vitale. Fast and optimal prediction of a labeled tree. In Proceedings 
of the 22nd Annual Conference on Learning Theory, 2009. 

[6] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Random spanning trees and the prediction of 
weighted graphs. In Proceedings of the 27th International Conference on Machine Learning, 2010. 

[7] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Active learning on trees and graphs. Proc. of 
the 23rd Conference on Learning Theory (COLT 2010). 

[8] C. Altafini G. Iacono. Monotonicity, frustration, and ordered response: an analysis of the energy 
landscape of perturbed large-scale biological networks. BMC Systems Biology, 4(83), 2010. 

[9] M. Herbster and G. Lever. Predicting the labelling of a graph via minimum p-seminorm interpolation. 
In Proceedings of the 22nd Annual Conference on Learning Theory. Omnipress, 2009. 

[10] M. Herbster, G. Lever, and M. Pontil. Online prediction on large diameter graphs. In Advances in 
Neural Information Processing Systems 22. MIT Press, 2009. 

[11] M. Herbster, M. Pontil, and S. Rojas-Galeano. Fast prediction on a tree. In Advances in Neural 
Information Processing Systems 22. MIT Press, 2009. 

[12] F. R. Kschischang, B. J. Frey, and H. A. Loeliger. Factor graphs and the sum-product algorithm. IEEE 
Transactions on Information Theory, 47(2):498-519, 2001. 

[13] R. Lyons and Y. Peres. Probability on trees and networks. Manuscript, 2008. 



10 



[14] S. T. McCormick, M. R. Rao, and G. Rinaldi. Easy and difficult objective functions for max cut. Math. 
Program., 94(2-3):459-466, 2003. 

[15] G. Pandey, M. Steinbach, R. Gupta, T. Garg, and V. Kumar. Association analysis-based transformations 
for protein interaction networks: a function prediction case study. In Proceedings of the 13th ACM 
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 540-549. ACM 
Press, 2007. 

[16] Yahoo! Research and Laboratory of Web Algorithmics University of Milan. Web spam collection. 
http://barcelona.research.yahoo.net/webspam/datasets/ 

[17] D. A. Spielman and N. Srivastava. Graph sparsification by effective resistances. In Proc. of the 40th 
annual ACM symposium on Theory of computing (STOC 2008). ACM Press, 2008. 

[18] D.B. Wilson. Generating random spanning trees more quickly than the cover time. In Proceedings of 
the 28th ACM Symposium on the Theory of Computing, pages 296-303. ACM Press, 1996. 

[19] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic 
functions. In Proceedings of the 20th International Conference on Machine Learning, 2003. 

Proof of Theorem 1 

Pick any E' C E such that £(M) = \E'\. Let F be the forest obtained by removing from T all edges in 
E' . Draw an independent random label for each of the \E'\ + 1 components of F and assign it to all nodes 
of that component. Then any online algorithm makes in expectation at least half mistake per component, 
which implies that the overall number of online mistakes is {\E'\ + l)/2 > £(M)/2 in expectation. On the 
other hand, & w < M clearly holds by construction. 

Proof of Theorem 2 

We first give additional definitions used in the analysis, then we present the main ideas, and finally we 
provide full details. 

Recall that, given a labeled tree (T, y), a cluster is any maximal subtree whose nodes have the same 
label. Let C be the set of all clusters of T. For any cluster C G C, let Mc be the subset of all nodes of C 
on which SHAZOO makes a mistake. Let C be the subtree of T obtained by adding to C all nodes that are 
adjacent to a node of C. Note that all edges connecting a node of C \ C to a node of C are 0-edges. Let 
E^ be the set of c/>-edges in C and let $7=r = \Ep\. Let be the total weight of the edges in Eft. Finally, 
recall the notation = 2~2(i W~' wnere L is any line graph. 

Recall that an in-cluster line graph is any line graph that is entirely contained in a single cluster. The 
main idea used in the proof below is to bound \Mc\ for each C G C in the following way. We partition Mc 
into 0(\E'-^\) groups, where E^ C £U. Then we find a set Cc of edge-disjoint in-cluster line graphs, and 
create a bijection between lines in Cc and groups in Mc- We prove that the cardinality of each group is at 
most rriL = min||L|, 1 + [ln(l + ^^i?^)] j, where L G Cc is the associated line. This shows that the 
subset Mt of nodes in T which are mispredicted by SHAZOO satisfies 



\m t \ = Y,\ M c\<T,H m t 



2 mL 



CdC C&C L&C C 



L&C T 
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where Ct = UceC ^C- Then we show that 

£ £ ^ = o(^). 

cec (;,j)ei% 

By the very definition of £, and using the bijection stated above, this implies 

|£r| = £ |£c| = 0(E 1^1 ) = . 
cec Vcec / 

thereby resulting in the mistake bound contained in Theorem 2. 
The details of the proof require further notation. 

According to SHAZOO prediction rule, when i t is not a fork and C{H{i t )) ^ 0, the algorithm predicts 
Ui t using the label of any j G C (H(it)) closest to it. In this case, we call j an r-node (reference node) for 
i t and the pair {j, (j, v)}, where (j, v) is the edge on the path between j and it, an rn-direction (reference 
node direction). We use the shorthand notation i* to denote an r-node for i. In the special case when all 
connection nodes i of the hinge tree containing i t have A{i) = (i.e., C(H(i t )) = 0), and i t is not a fork, 
we call any closest connection node jo to i t an r-node for i t and we say that {jo, {jo, v )} is a rn-direction 
for i t . Clearly, we may have more than one node of Mc associated with the same rn-direction. Given any 
rn-direction {j, {j, v )}, we call r-line (reference line) the line graph whose terminal nodes are j and the first 
(in chronological order) node jo £ V for which {j, (j, v)} is a rn-direction, where {j, v) lies on the path 
between jo and We denote such an r-line by L{j, v). 

In the special case where j € C and jo ^ C we say that the r-line is associated with the 0-edge of 

included in the line-graph. In this case we denote such an r-line by L{u, q), where {u, q) G E~. Figure 2 
gives a pictorial example of the above concepts. 




Figure 2: We illustrate an example of r-node, rn-direction and r-line. The numbers near the edge lines 
denote edge weights. In order to predict yi 2 , SHAZOO uses the r-node i\ and the rn-direction {i\, {i\, v)}. 
After observing yi 2 , the hinge line connecting i\ with %2 (the thick black line) is created, which is also an 
r-line, since at the beginning of step t = 2 the algorithm used {ii,v)}. In order to predict yj 3 , we still 
use the r-node i\ and the rn-direction {ii, (ii, v)}. After the revelation of ?/j 3 , node / becomes a fork. 

We now cover Mc (the subset of all nodes of C G C on which SHAZOO makes a mistake) by the 
following subsets: 

• is the set of all forks in M c . 

5 We may also have v = jo- 
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• M™ is the subset of Mq containing the nodes % whose reference node i* belongs to C (if i is a fork, 
then i* = i). Note that this set may have a nonempty intersection with the previous one. 

• M c ut is the subset of Mq containing the nodes i such that i* does not belong to C. 
Two other structures that are relevant to the proof: 

• C F is the subset of all forks / G Vc such that A(/) < at some step t. Since we assume the cluster 
label is +1 (see below), and since a fork i t G Vc is mistaken only if A(i t ) < 0, we have M F C C F . 

• C F ' is the subset of all nodes in Mq that, when revealed, create a fork that belongs to C F . Since at 
each time step at most one new fork can be created]^] we have \C F '\ < \C F \. 

The proof of the theorem relies on the following sequence of lemmas that show how to bound the number 
of mistakes made on a given cluster C = (Vc, Ec)- A major source of technical difficulties, that makes this 
analysis different and more complex than those of TreeOpt and WTA, is that on a weighted tree the value 
of A(i) on forks i can potentially change after each prediction. 

Without loss of generality, from now on we assume all nodes in C are labeled +1. Keeping this assump- 
tion in mind is crucial to understand the arguments that follow. 

For any node i G Vc, let A(i) be the value of A(i) when all nodes in C \ C are revealed. 

Lemma 3 For any fork f ofC and any step t = 1, . . . , n, we have A(/) < A(/). 

Proof. For the sake of contradiction, assume A(/) > A(_f). Let be the maximal subtree of T rooted at 
/ such that no internal node of is revealed. Now, consider the cut given by the edges of E c belonging 
to the hinge lines of T*. This cut separates / from any revealed node labeled with —1. The size of this cut 
cannot be larger than <E>^\ By definition of A(-), this implies A(/) < <i>^. However, also A(/) cannot be 
larger than Because 

A(H)< E Wu = *g 

must hold independent of the set of nodes in Vc that are revealed before time t, this entails a contradiction. 

□ 

Let now £g? be the restriction of £ on the subtree C, and let Dc be the set of all distinct rn-directions 
which the nodes of can be associated with. The next lemmas are aimed at bounding | C F \ and | Dc \ ■ 
We first need to introduce the superset D' c of Dc- Then, we show that for any C both \D' C \ and \C F \ are 
linear in%($g"). 

In order to do so, we need to take into account the fact that the sign of A for the forks in the cluster can 
change many times during the prediction process. This can be done via Lemma [3} which shows that when 
all labels in C \ C are revealed then, for all fork / G C, the value A(/) does not increase. Thus, we get the 
largest set Dc when we assume that the nodes in C \ C are revealed before the nodes of C. 

Given any cluster C, let a-^ be the order in which the nodes of C are revealed. Let also a 1 -^ be the 
permutation in which all nodes in C are revealed in the same order as cr^, and all nodes in C \ C are 
revealed at the beginning, in any order. Now, given any node revelation order o-q, D' c can be defined by 
describing the three types of steps involved in its incremental construction supposing a'-^ was the actual node 
revelation order. 

6 In step t a new fork j is created when the number of edge-disjoint paths connecting j to the labeled nodes increases. This event 
occurs only when a new hinge line Tv(i t , /) is created. When this happens, the only node for which the number of edge-disjoint 
paths connecting it to labeled nodes gets increased is the terminal node j of the newly created hinge line. 
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1. After the first |C\ C| = steps, D' c contains all node-edge pairs {i, (i, j)} such that i is a fork and 
(i , j) is an edge laying on a hinge line of C. Recall that no node in C is revealed yet. 

2. For each step t > when a new fork / is created such that A(/) < just after the revelation of yi t , 
we add to D' c the three node-edge pairs {/, (/, j)}, where the (/, j) are the edges contained in the 
three hinge lines terminating at /. 

3. Let s be any step where: (i) A new hinge line 7r(i s ,i* s ) is created, (ii) node i* is a fork, and (iii) 
A(i*) < at time s — 1. On each such step we add {i*, (ig,j)} to D' c , for j in n(i s , i^). 

It is easy to verify that, given any ordering g-q for the node revelation in C, we have Dc Q D' c . In fact, 
given an rn-direction G Dc, if (i, j) lies along one of the hinge lines that are present at time 

according to a'^, then {i, must be included in D' c during one of the steps of type 2 above, otherwise 

{*') (*> j)} W11 l be included in D' c during one of the steps of type 2 or type 3. 

As announced, the following lemmas show that \D' C \ and \C F \ are both of the order of £^r(<3?^). 

Lemma 4 (i) The total number of forks at time t = is 0(£(<&|j )). (ii) The total number of elements 
added to D' c in the first step of its construction is 0(£(<]?^)). 

Proof. Assume nodes are revealed according to a'-^. Let C be the subtree of C made up of all nodes in 
C that are included in any path connecting two nodes of C \ C. By their very definition, the forks at time 
t = are the nodes of Vc having degree larger than two in subtree C . Consider C as rooted at an 
arbitrary node of C \ C. The number of the leaves of C is equal to \C \ C\ — 1. This is in turn O ) 
because 

Now, in any tree, the sum of the degrees of nodes having degree larger than two cannot is at most linear in 
the number of leaves. Hence, at time t = both the number of forks in C and the cardinality of D' c are 

OfeyCSgr)). ^ ' □ 

Let now Vj be the minimal cutsize of T consistent with the labels seen before step t + 1, and notice that 
is nondecreasing with t. 

Lemma 5 Let t be a step when a new hinge line ir(it, q) is created such that it,Q& Vc- If just after step t 
we have A(q) < 0, then Tf — rf_ 1 > w UjV , where (u, v) is the lightest edge on ir(it, q). 

Proof. Since A(q) < and n(it, q) is completely included in C, we must have A(q) < just before the 
revelation of y^. This implies that the difference Tf — Tf_ 1 cannot be smaller than the minimum cutsize 
that would be created on n(it, q) by assigning label —1 to node q. □ 

Lemma 6 Assume nodes are revealed according to o~y. Then the cardinality of C F and the total number of 
elements added to D' c during the steps of type 2 above are both linear in ^($^). 

Proof. Let C F be the set of forks in Vc such that A(/) < at some time t < \V\. Recall that, by definition, 
for each fork / £ C F there exists a step tf such that A(/) < 0. Hence, Lemma pi implies that, at the same 
step tf, for each fork / G C F we have A(/) < 0. Since C F is included in Cq , we can bound \C F \ by 
\C F \, i.e., by the number of forks i G Vc such that A(i) < 0, under the assumption that o'-^ is the actual 
revelation order for the nodes in C. 
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Now, \C F \ is bounded by the number of forks created in the first |C \ C| = &Q steps, which is equal to 
0(£(<3?^)) plus the number of forks / created at some later step and such that A(/) < right after their 
creation. Since nodes in C are revealed according to a'-^, the condition A(/) > just after the creation of 
a fork / implies that we will never have A(/) < in later stages. Hence this fork / belongs neither to C F 
nor to C F . 

In order to conclude the proof, it suffices to bound from above the number of elements added to D' c 
in the steps of type 2 above. From Lemma [5] we can see that for each fork / created at time t such that 
A(/) < just after the revelation of node it, we must have \Tf — Tf_i\ > w u ,v, where (u, v) is the lightest 
edge in ir(it, /). Hence, we can injectively associate each element of C F with an edge of Ec, in such a way 
that the sum of the weights of these edges is bounded by <£^. By definition of £, we can therefore conclude 
that the total number of elements added to D' c in the steps of type 2 is O . □ 

With the following lemma we bound the number of nodes of Mq \ C F> associated with every re- 
direction and show that one can perform a transformation of the r-lines so as to make them edge-disjoint. 
This transformation is crucial for finding the set Ct appearing in the theorem statement. Observe that, by 
definition of r-line, we cannot have two r-lines such that each of them includes only one terminal node of the 
other. Thus, let now Fc be the forest where each node is associated with an r-line and where the parent-child 
relationship expresses that (i) the parent r-line contains a terminal node of the child r-line, together with (ii) 
the parent r-line and the child r-line are not edge-disjoint. Fc is, in fact, a forest of r-lines. We now use 
m L(j,v) f° r bounding the number of mistakes associated with a given rn-direction {i, (j, v)} or with a given 
0-edge (j,v). Given any connected component T' of Fc, let finally mj" be the total number of nodes of 
Mq \ C F associated with the rn-directions {i, (i, j)} of all r-lines L[i, j) of T' . 

Lemma 7 Let C be any cluster. Then: 

(i) The number of nodes in Mq \ C F ' associated with a given rn-direction {j, (j, v)} is of the order of 

(ii) The number of nodes in M c ut \ C F associated with a given cp-edge (u, q) is of the order ofmu u ,q)- 

(Hi) Let L(j r , v r ) be the r-line associated with the root of any connected component T 1 of Fc- mx' must 
be at most of the same order of 

m LU,v) + \V T >\ 

L(j,v)€C(L(j r ,v r )) 

where C(L(j r ,v r )) is a set of\Vr' \ edge-disjoint line graphs completely contained in L(j r , v r ). 

Proof. We will prove only (i) and (iii), (ii) being similar to (i). Let it be a node in Mq \ C F ' associated 
with a given rn-direction {j, (j, v)}. There are two possibilities: (a) i t is in L(j, v) or (b) the revelation of 
Ui t creates a fork / in L(j, v) such that A(/) > for all steps s > t. Let now if be the next node (in 
chronological order) of Mq \ C F associated with {j, (j, v)}. The length of n(it> , it) cannot be smaller than 
the length of 7r( v , j) (under condition (a)) or smaller than the length of 7r(/, j) (under condition (b)). 

This clearly entails a dichotomic behaviour in the sequence of mistaken nodes in Mq \ C F associated 
with {j, (j, v)}. Let now p be the node in L(j, v) which is farthest from j such that the length of n(p, j) is 
not larger than & w . Once a node in n(p, j) is revealed or becomes a fork / satisfying A(/) > for all steps 
s > t, we have A(j) > for all subsequent steps (otherwise, this would contradict the fact that the total 
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cutsize of T is <fr w ). Combined with the above sequential dichotomic behavior, this shows that the number 
of nodes of M™ \ C F associated with a given rn-direction {j, (j, v)} can be at most of the order of 



min ^ \L(j,v)\, 1 + 



log 2 



m L (j,v) 



Part (iii) of the statement can be now proved in the following way. Suppose now that an r-line L(j, v), 
having j and jo as terminal nodes, includes the terminal node f of another r-line L(j', v'), having j' and j' 
as terminal nodes. Assume also that the two r-lines are not edge-disjoint. If L(j', v') is partially included in 
L(j, v), i.e., if j' does not belong to L(j, v ), then L(j' , v') can be broken into two sub-lines: the first one has 
j' and k as terminal nodes, being k the node in L(j, v) which is farthest from j'; the second one has k and 
j' as terminal nodes. It is easy to see that L(j, v) must be created before L(j' , v') and jo is the only node of 
the second sub-line that can be associated with the rn-direction {j', (j', v')}. This observation reduces the 
problem to considering that in T' each r-line that is not a root is completely included in its parent. 
Given an r-line L(u, q) having u and z as terminals, we denote by m n r u l A the quantity m^ u q y 
Consider now the simplest case in which T' is formed by only two r-lines: the parent r-line L(j p , v p ), 
which completely contains the child r-line L(j c , v c ). Let s be the step in which the first node u of L(j p , v p ) 
becomes a hinge node. After step s, L(j p , v p ) can be vieved as broken in two edge-disjoint sublines having 
{j p , u} and {jo, u] as terminal node sets, where jo is one of the terminal of L(j p , v p ). Thus, 

m T > < max m^u u \ + m n(Utjo) + 1 . 

Generalizing this argument for every component T" of Fo, and using the above observation about the par- 
tially included r-lines, we can state that, for any component T' of Fc, my is of the order of 

N-l 

max (n^y Ul) + m l(urto) + ^ m^ Uk>u } + 2\V T A 

ui,...,u N <=V L(jp!Vp) V ^ / 

where = \Vt'\ — 1. This entails that we can define C(L(j r ,v r )) as the union of {vr(j p , ui), 7t(un, jo)} 
and Ufc^Ti 1 ^(uk, Uk+i), which concludes the proof. 

□ 



Lemma 8 The total number of elements added to D' c during steps of type 3 above is of the order o/^($^ ). 

Proof. Assume nodes are revealed according to <r^, and let s be any type-3 step when a new element is 
added to D' c . There are two cases: (a) A(i*) < at time s or (b) A(i*) > at time s. 

Case (a). Lemma [5] combined with the fact that all hinge-lines created are edge-disjoint, ensures that 
we can injectively associate each of these added elements with an edge of Ec in such a way that the total 
weight of these edges is bounded by $>^. This in turn implies that the total number of elements added to 

E C is 0{£c($p))- 

Case (b). Since we assumed that nodes are revealed according to a'-^, we have that A(i*) is positive for 
all steps t > s. Hence we have that case (b) can occur only once for each of such forks i*. Since this kind 
of fork belongs to C F , we can use Lemmajfijand conclude that (b) can occur at most \C F \ = 0(£^(<£^)) 
times. □ 



Lemma 9 With the notation introduced so far, we have \Dq\ = 
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Proof. Combining Lemma ffl Lemma |6j and Lemma [8] we immediately have D' c = The 
claim then follows from Dc<L D' c . □ 

We are now ready to prove the theorem. 
Proof of Theorem 2. Let Ft be the union of Fc over C G C. Using Lemma [9] we deduce \Vf c \ = 
§c + 0(£g($^)) = 0(£q($^)), where the term takes into account that at most one r-line of F c 
may be associated with each 0-edge of C. 

By definition of £(•), this implies |Vp T | = 0(£($> w )). Using part (i) and (ii) of Lemma [7] we have 
\M T \ < \Mg\ + \M&\ + \Mg*\ < \C F \ + \C F '\ + Z L ev FT ™l < Zlcv Ft ™l + O 

Let now T(Fj>) be the set of components of Fj>. Given any tree T' G T{Ft), let r(T') be the r-line root 
of V . Recall that, by part (iii) of Lemma [7] for any tree V € T{F T ) we can find a set £(r(T')) of \V T >\ 
edge-disjoint line graphs all included in r(T') such that my/ is of the order of Y1leC t , (r(T')) m £ + I^T'I- 
Let now £' T be equal to Ux'eT{F T )^-( r {T')). Thus we have 

\M T \ = ol m L + \V FT \+^ w )\ =o( £ rn L + m w ) 
\LeC T ) \LeC T 

Observe that C' T is not an edge disjoint set of line graphs included in T only because each 0-edge may belong 
to two different lines of C' T . By definition of mx, for any line graphs L and L', where L' is obtained from L 
by removing one of the two terminal nodes and the edge incident to it, we have my = mi + 0(1). If , for 
each 0-edge shared by two line graphs of C' T , we shorten the two line graphs so as no one of them includes 
the 0-edge, we obtain a new set of edge-disjoint line graphs Ct such that YltLeC m L = Y1l'&c t )• 

Hence, we finally obtain \M T \ = o(jT L , eCT my + (,($ w f) = (Zz/e,c T m £') > where in the last 
equality we used the fact that my > 1 for all line graphs L'. □ 
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